Independent Component Analysis (ICA)#
ICA is a method for blind source separation: when you only observe mixtures, can you recover the original sources?
Think: “unmixing voices at a cocktail party.” Multiple speakers talk at once, each microphone records a different linear mixture, and ICA tries to recover the individual voices.
Learning goals#
explain ICA’s core assumptions (independence + non-Gaussianity)
distinguish independence vs uncorrelatedness with a counterexample
understand why non-Gaussianity is the signal ICA uses (CLT intuition)
implement whitening and see its effect on covariance
run ICA (
FastICA) and compare with PCA visually (Plotly)
Prerequisites#
covariance / correlation
eigenvalues & eigenvectors (PCA-style intuition)
basic probability (Gaussian vs non-Gaussian)
Notation (linear mixing model)#
Sources (unknown): \(S \in \mathbb{R}^{n\times k}\)
Mixing matrix (unknown): \(A \in \mathbb{R}^{m\times k}\)
Observations (what we measure): \(X \in \mathbb{R}^{n\times m}\)
We assume linear instantaneous mixing:
ICA tries to estimate an unmixing matrix \(W\) so that:
In practice, \(\hat{S}\) is only identifiable up to permutation (ordering) and scaling/sign.
Table of contents#
Intuition: cocktail party unmixing
Statistical foundation
independence vs uncorrelated
non-Gaussianity (CLT intuition)
kurtosis / (approx) negentropy
Algorithm mechanics
whitening
maximizing independence (FastICA idea)
Plotly visualizations
mixed vs unmixed signals
PCA vs ICA comparison
Use cases (signal processing, finance, neuroscience)
Exercises + references
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.decomposition import FastICA, PCA
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Intuition — “unmixing voices at a cocktail party”#
Imagine 3 people talking at the same time and 3 microphones in the room.
Each microphone records a different mixture of the 3 voices.
You only get the microphone recordings \(X\).
You want to recover the original voices \(S\) without knowing the room acoustics \(A\).
In a toy setting we can simulate this by:
choosing three independent-looking source signals
mixing them with a random matrix
attempting to recover the sources from the mixtures
# Synthetic sources (toy "voices")
n_samples = 2500
t = np.linspace(0.0, 8.0, n_samples)
s1 = np.sin(2 * np.pi * 1.0 * t)
s2 = np.where(np.sin(2 * np.pi * 3.0 * t) >= 0.0, 1.0, -1.0) # square-ish
s3 = 2.0 * ((t * 0.7) % 1.0) - 1.0 # sawtooth-ish ramp in [-1, 1)
s3 = s3 + 0.10 * rng.normal(size=n_samples)
S = np.c_[s1, s2, s3]
S = StandardScaler().fit_transform(S) # each source: zero mean, unit variance
# Random mixing matrix ("room")
A = rng.normal(size=(3, 3))
while abs(np.linalg.det(A)) < 1e-3:
A = rng.normal(size=(3, 3))
X = S @ A.T
X = StandardScaler().fit_transform(X) # each observed channel: zero mean, unit variance
fig = make_subplots(
rows=3,
cols=2,
shared_xaxes=True,
vertical_spacing=0.06,
column_titles=("Sources (unknown)", "Observed mixtures (microphones)"),
)
for i in range(3):
fig.add_trace(go.Scatter(x=t, y=S[:, i], mode="lines", line=dict(width=1)), row=i + 1, col=1)
fig.add_trace(go.Scatter(x=t, y=X[:, i], mode="lines", line=dict(width=1)), row=i + 1, col=2)
for i in range(3):
fig.update_yaxes(title_text=f"s{i + 1}", row=i + 1, col=1)
fig.update_yaxes(title_text=f"x{i + 1}", row=i + 1, col=2)
fig.update_xaxes(title_text="time", row=3, col=1)
fig.update_xaxes(title_text="time", row=3, col=2)
fig.update_layout(title="Toy cocktail party: sources vs mixtures", showlegend=False, height=650)
fig.show()
Two important “gotchas” that are features, not bugs:
Permutation ambiguity: ICA doesn’t know which source is “voice 1” vs “voice 2”.
Scaling/sign ambiguity: multiplying a source by 2 and dividing the corresponding column of \(A\) by 2 leaves \(X\) unchanged. Signs can flip too.
So success looks like recovering the shapes, not recovering a canonical ordering or amplitude.
2) Statistical foundation#
2.2 Non-Gaussianity (why ICA has signal to grab)#
A key intuition comes from the Central Limit Theorem (CLT):
A sum of (many) independent random variables tends to look more Gaussian.
Mixing independent sources produces observations that are sums of sources, so the mixtures often look more Gaussian than the original signals.
ICA works by finding directions (linear combinations) that look maximally non-Gaussian, because those directions are good candidates for the original sources.
fig = go.Figure()
fig.add_trace(go.Histogram(x=S[:, 1], name="source s2 (square-ish)", nbinsx=60, opacity=0.65))
fig.add_trace(go.Histogram(x=X[:, 0], name="mixture x1", nbinsx=60, opacity=0.65))
fig.update_layout(
barmode="overlay",
title="Mixing tends to look 'more Gaussian' (CLT intuition)",
xaxis_title="value",
yaxis_title="count",
)
fig.show()
2.3 Kurtosis / (approx) negentropy#
Non-Gaussianity can be quantified in many ways. Two common ideas:
Excess kurtosis: for a standardized variable, \(\kappa = \mathbb{E}[Z^4] - 3\). A Gaussian has \(\kappa = 0\).
Negentropy: how far a distribution’s entropy is from a Gaussian with the same variance. (Hard to compute exactly; ICA often uses proxies.)
FastICA (and many ICA variants) use a contrast function that increases when a projected signal deviates from Gaussian.
def standardize_1d(x: np.ndarray) -> np.ndarray:
x = np.asarray(x)
x = x - x.mean()
return x / (x.std() + 1e-12)
def excess_kurtosis(x: np.ndarray) -> float:
z = standardize_1d(x)
return float(np.mean(z**4) - 3.0)
def negentropy_proxy(x: np.ndarray, n_ref: int = 20000, seed: int = 0) -> float:
"""A simple negentropy proxy using the log-cosh contrast.
This is *not* an exact negentropy computation; it’s an intuitive scalar that
increases when a distribution deviates from Gaussian.
"""
z = standardize_1d(x)
rng_local = np.random.default_rng(seed)
v = rng_local.standard_normal(size=n_ref)
def G(u):
u = np.clip(u, -10.0, 10.0)
return np.log(np.cosh(u))
return float((np.mean(G(z)) - np.mean(G(v))) ** 2)
labels = ["1", "2", "3"]
kurt_S = [excess_kurtosis(S[:, i]) for i in range(3)]
kurt_X = [excess_kurtosis(X[:, i]) for i in range(3)]
neg_S = [negentropy_proxy(S[:, i], seed=10 + i) for i in range(3)]
neg_X = [negentropy_proxy(X[:, i], seed=20 + i) for i in range(3)]
fig = make_subplots(rows=1, cols=2, subplot_titles=("Excess kurtosis", "Negentropy proxy (log-cosh)"))
fig.add_trace(go.Bar(x=labels, y=kurt_S, name="sources"), row=1, col=1)
fig.add_trace(go.Bar(x=labels, y=kurt_X, name="mixtures"), row=1, col=1)
fig.add_trace(go.Bar(x=labels, y=neg_S, name="sources"), row=1, col=2)
fig.add_trace(go.Bar(x=labels, y=neg_X, name="mixtures"), row=1, col=2)
fig.update_layout(barmode="group", title="Non-Gaussianity proxies (toy example)")
fig.update_xaxes(title_text="component", row=1, col=1)
fig.update_xaxes(title_text="component", row=1, col=2)
fig.show()
3) Algorithm mechanics#
Most ICA algorithms have the same shape:
center: subtract the mean
whiten: remove second-order correlations (Cov becomes identity)
rotate: find directions that maximize independence (usually via non-Gaussianity)
3.1 Whitening#
Whitening transforms the observations so their covariance matrix becomes approximately the identity.
Why it helps:
It removes all second-order structure (correlations).
After whitening, the remaining mixing is (roughly) a rotation.
ICA can then focus on higher-order structure to find independent components.
def whiten(X: np.ndarray, eps: float = 1e-12):
"""Center + whiten so that Cov(X_white) ≈ I."""
Xc = X - X.mean(axis=0, keepdims=True)
cov = (Xc.T @ Xc) / Xc.shape[0]
eigvals, eigvecs = np.linalg.eigh(cov)
order = np.argsort(eigvals)[::-1]
eigvals = eigvals[order]
eigvecs = eigvecs[:, order]
D_inv_sqrt = np.diag(1.0 / np.sqrt(eigvals + eps))
W = eigvecs @ D_inv_sqrt @ eigvecs.T
X_white = Xc @ W.T
return X_white, W
X_white, W_white = whiten(X)
cov_X = (X.T @ X) / X.shape[0]
cov_Xw = (X_white.T @ X_white) / X_white.shape[0]
fig = make_subplots(rows=1, cols=2, subplot_titles=("Cov(X) (mixtures)", "Cov(whitened X) (≈ I)"))
fig.add_trace(go.Heatmap(z=cov_X, zmin=-1, zmax=1, colorscale="RdBu"), row=1, col=1)
fig.add_trace(go.Heatmap(z=cov_Xw, zmin=-1, zmax=1, colorscale="RdBu"), row=1, col=2)
fig.update_layout(title="Whitening removes second-order correlations")
fig.show()
3.2 Maximizing independence (FastICA idea)#
After whitening, ICA searches for weight vectors \(w\) so that the projection \(w^T x\) is maximally non-Gaussian.
A common fixed-point update (one component) looks like:
followed by normalization and orthogonalization against previously found components.
The nonlinearity \(g\) (e.g. tanh) is chosen so the update increases a non-Gaussianity contrast (a proxy for independence).
from itertools import permutations
def corr_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
"""Signed correlation between columns of A and B."""
A0 = A - A.mean(axis=0, keepdims=True)
B0 = B - B.mean(axis=0, keepdims=True)
A0 = A0 / (A0.std(axis=0, keepdims=True) + 1e-12)
B0 = B0 / (B0.std(axis=0, keepdims=True) + 1e-12)
return (A0.T @ B0) / A0.shape[0]
def match_by_correlation(S_true: np.ndarray, S_est: np.ndarray):
"""Match estimated components to true sources (perm + sign).
This is only for visualization in toy data where we *know* the sources.
"""
C = corr_matrix(S_true, S_est)
k = C.shape[0]
best_perm = None
best_score = -np.inf
for perm in permutations(range(k)):
score = sum(abs(C[i, perm[i]]) for i in range(k))
if score > best_score:
best_score = score
best_perm = perm
perm = np.array(best_perm)
signs = np.sign([C[i, perm[i]] for i in range(k)]).astype(float)
signs = np.where(signs == 0.0, 1.0, signs)
S_matched = S_est[:, perm] * signs
return S_matched, perm, signs, C
pca = PCA(n_components=3, random_state=42)
S_pca = pca.fit_transform(X)
S_pca = StandardScaler().fit_transform(S_pca)
ica = FastICA(n_components=3, random_state=42, whiten="unit-variance", max_iter=2000)
S_ica = ica.fit_transform(X)
S_ica = StandardScaler().fit_transform(S_ica)
S_pca_matched, perm_pca, signs_pca, C_pca_raw = match_by_correlation(S, S_pca)
S_ica_matched, perm_ica, signs_ica, C_ica_raw = match_by_correlation(S, S_ica)
abs_corr_pca = np.abs(corr_matrix(S, S_pca_matched))
abs_corr_ica = np.abs(corr_matrix(S, S_ica_matched))
3.3 Practical ICA with scikit-learn#
sklearn.decomposition.FastICA is a solid default implementation of the FastICA family.
Data layout (common gotcha)
rows = samples (here: time points)
columns = observed mixtures (here: microphone/sensor channels)
Key parameters
n_components: how many independent components to extractwhiten:'unit-variance'(components scaled to unit variance) vs'arbitrary-variance'vsFalsealgorithm:'parallel'(estimate all components together) vs'deflation'(one-by-one)fun: contrast / nonlinearity ('logcosh'default,'exp','cube')max_iter,tol: convergence control
What you get back
fit_transform(X)returns \(\hat{S}\) (up to permutation/sign/scale)mixing_andcomponents_are estimated mixing/unmixing matrices (same ambiguities)
In the toy problem we can match components to sources using correlation only because we know the ground truth.
4) Plotly visualizations#
4.1 Mixed vs unmixed signals#
In real life you don’t have access to the true sources — but in a toy dataset we can show the idea clearly.
fig = make_subplots(
rows=3,
cols=3,
shared_xaxes=True,
vertical_spacing=0.06,
column_titles=("Sources (truth)", "Mixtures (observed)", "ICA (unmixed)"),
)
for i in range(3):
fig.add_trace(go.Scatter(x=t, y=S[:, i], mode="lines", line=dict(width=1)), row=i + 1, col=1)
fig.add_trace(go.Scatter(x=t, y=X[:, i], mode="lines", line=dict(width=1)), row=i + 1, col=2)
fig.add_trace(
go.Scatter(x=t, y=S_ica_matched[:, i], mode="lines", line=dict(width=1)),
row=i + 1,
col=3,
)
for i in range(3):
fig.update_yaxes(title_text=f"{i + 1}", row=i + 1, col=1)
fig.update_yaxes(title_text=f"{i + 1}", row=i + 1, col=2)
fig.update_yaxes(title_text=f"{i + 1}", row=i + 1, col=3)
fig.update_xaxes(title_text="time", row=3, col=1)
fig.update_xaxes(title_text="time", row=3, col=2)
fig.update_xaxes(title_text="time", row=3, col=3)
fig.update_layout(title="Mixed vs unmixed signals (toy example)", showlegend=False, height=650)
fig.show()
4.2 PCA vs ICA comparison#
PCA and ICA both produce linear components, but they optimize different objectives:
PCA: directions of maximum variance (components are orthogonal and uncorrelated).
ICA: directions of maximum independence (using non-Gaussianity).
In this toy problem, ICA aligns components with sources much better than PCA.
labels_sources = ["s1", "s2", "s3"]
labels_components = ["c1", "c2", "c3"]
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("PCA: |corr(component, source)|", "ICA: |corr(component, source)|"),
)
fig.add_trace(
go.Heatmap(
z=abs_corr_pca.T,
x=labels_sources,
y=labels_components,
zmin=0,
zmax=1,
colorscale="Blues",
),
row=1,
col=1,
)
fig.add_trace(
go.Heatmap(
z=abs_corr_ica.T,
x=labels_sources,
y=labels_components,
zmin=0,
zmax=1,
colorscale="Blues",
),
row=1,
col=2,
)
fig.update_layout(title="PCA vs ICA on the same mixtures", height=420)
fig.show()
5) Use cases#
ICA is most useful when “mixtures of independent-ish things” is a good model.
Signal processing#
Audio source separation (cocktail party) when mixing is approximately linear.
Denoising / artifact removal: separate structured signal from independent noise-like components.
Finance#
Discover latent factors behind correlated asset returns (market/sector/style components).
Separate independent drivers for risk modeling or portfolio construction.
Neuroscience#
EEG/MEG: remove artifacts (eye blinks, muscle activity) by identifying independent sources.
fMRI: identify spatially independent activation patterns (often with additional constraints).
Practical pitfalls#
ICA is sensitive to preprocessing (centering, whitening, scaling).
Components are only defined up to permutation/sign/scale.
If the independence/non-Gaussian assumptions are wrong, ICA can return unstable or uninterpretable components.
6) Exercises + references#
Exercises#
Make one source Gaussian and see when ICA becomes ambiguous.
Increase the number of sources/microphones and check when recovery degrades.
Compare kurtosis/negentropy of PCA vs ICA components on the same mixtures.
References#
Hyvärinen & Oja (2000): Independent Component Analysis: Algorithms and Applications
scikit-learn docs:
sklearn.decomposition.FastICACardoso (1998): Blind signal separation: statistical principles